ABanditNAS: Anti-Bandit for Neural Architecture Search

97

Finally, after KT samples where T is a hyperparameter, we calculate the confidence

with the UCB according to Eq. 4.8 as

sU(o(i,j)

k

) = m(i,j)

k,t

+



2 log N

n(i,j)

k,t

.

(4.12)

The operation with minimal UCB for every edge is abandoned. This means that operations

that are given more opportunities but result in poor performance are removed. With this

pruning strategy, the search space is significantly reduced from |Ω(i,j)|10×6 to (|Ω(i,j)| −

1)10×6, and the reduced space becomes

Ω(i,j) Ω(i,j) −{arg min

o(i,j)

k

sU(o(i,j)

k

)},(i, j).

(4.13)

The reduction procedure is repeated until the optimal structure is obtained, where only one

operation is left on each edge.

Complexity Analysis. There are O(K|EMv) combinations in the search space dis-

covery process with v types of different cells. In contrast, ABanditNAS reduces the search

space for every KT epoch. Therefore, the complexity of the proposed method is the

following.

O(T ×

K



k=2

k) = O(TK2).

(4.14)

4.2.4

Adversarial Optimization

The goal of adversarial training [167] is to learn networks that are robust to adversarial

attacks. Given a network fθ parameterized by θ, a dataset (xe, ye), a loss function l and

a threat model Δ, the learning problem can be formulated as the following optimization

problem: minθ



e maxδΔ l



fθ(xe + δ), ye



, where δ is the adversarial perturbation. In this

chapter, we consider the typical lthreat model [167], Δ = {δ :δϵ} for some ϵ > 0.

Here, ∥· ∥is the lnorm distance metric and ϵ is the adversarial manipulation budget.

The adversarial training procedure uses attacks to approximate inner maximization over

Δ, followed by some variation of gradient descent on model parameters θ. For example,

one of the earliest versions of adversarial training uses the Fast Gradient Sign Method

(FGSM) [75] to approximate the inner maximization. This could be seen as a relatively

inaccurate approximation of inner maximization for lperturbations, and it has the closed-

form solution: θ = ϵ · sign



xl



f(x), y



. A better approximation of inner maximization

is to take multiple smaller FGSM steps of size α instead. However, the number of gradient

computations caused by the multiple steps is proportional to O(EF) in a single epoch, where

E is the size of the data set and F is the number of steps taken by the adversary PGD.

This is F times higher than standard training with O(E) gradient computations per epoch,

and adversarial training is typically F times slower. To accelerate adversarial training, we

combine FGSM with random initialization [247] for our ABanditNAS. Our ABanditNAS

with adversarial training is summarized in Algorithm 8.

4.2.5

Analysis

Effect on the hyperparameter λ. The hyper-parameter λ balances the performance

between the past and the current. Different values of λ result in similar search costs. The

performance of the structures searched by ABanditNAS with different values of λ is used